NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Protein Structure Tokenization via Geometric Byte Pair Encoding

Sun, Michael; Yuan, Weize; Liu, Gang; Matusik, Wojciech; Zitnik, Marinka (May 2026, ICLR)

Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an SE(3) end-frame loss. GeoBPE offers compression (>10x reduction in bits-per-residue at similar distortion rate), data efficiency (>10x less training data), and generalization (maintains test/train distortion ratio of 1.0−1.1). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across 12 tasks and 24 test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs.
more » « less
Full Text Available
Vascular smooth muscle cells can be circumferentially aligned inside a channel using tunable gelatin microribbons

https://doi.org/10.1088/1758-5090/ad88a7

Mastoor, Yusuf; Karimi, Mahsa; Sun, Michael; Ahadi, Fereshteh; Mathieu, Pattie; Fan, Mingyue; Han, Lin; Han, Li-Hsin; Clyne, Alisa Morss (October 2024, Biofabrication)

Abstract The gold standard to measure arterial health is vasodilation in response to nitric oxide. Vasodilation is generally measured via pressure myography of arteries isolated from animal models. However, animal arteries can be difficult to obtain and may have limited relevance to human physiology. It is, therefore, critical to engineer human cell-based arterial models capable of contraction. Vascular smooth muscle cells (SMCs) must be circumferentially aligned around the vessel lumen to contract the vessel, which is challenging to achieve in a soft blood vessel model. In this study, we used gelatin microribbons to circumferentially align SMCs inside a hydrogel channel. To accomplish this, we created tunable gelatin microribbons of varying stiffnesses and thicknesses and assessed how SMCs aligned along them. We then wrapped soft, thick microribbons around a needle and encapsulated them in a gelatin methacryloyl hydrogel, forming a microribbon-lined channel. Finally, we seeded SMCs inside the channel and showed that they adhered best to fibronectin and circumferentially aligned in response to the microribbons. Together, these data show that tunable gelatin microribbons can be used to circumferentially align SMCs inside a channel. This technique can be used to create a human artery-on-a-chip to assess vasodilation via pressure myography, as well as to align other cell types for 3Din vitromodels.
more » « less
Full Text Available
The combined importance of finite dimensions, anisotropy, and pre-stress in acoustoelastography

https://doi.org/10.1121/10.0010110

Crutison, Joseph; Sun, Michael; Royston, Thomas J. (April 2022, The Journal of the Acoustical Society of America)

Dynamic elastography, whether based on magnetic resonance, ultrasound, or optical modalities, attempts to reconstruct quantitative maps of the viscoelastic properties of biological tissue, properties that are altered by disease and injury, by noninvasively measuring mechanical wave motion in the tissue. Most reconstruction strategies that have been developed neglect boundary conditions, including quasistatic tensile or compressive loading resulting in a nonzero prestress. Significant prestress is inherent to the functional role of some biological tissues currently being studied using elastography, such as skeletal and cardiac muscle, arterial walls, and the cornea. In the present article, we review how prestress alters both bulk mechanical wave motion and wave motion in one- and two-dimensional waveguides. Key findings are linked to studies on skeletal muscle and the human cornea, as one- and two-dimensional waveguide examples. This study highlights the underappreciated combined acoustoelastic and waveguide challenge to elastography. Can elastography truly determine viscoelastic properties of a material when what it is measuring is affected by both these material properties and unknown prestress and other boundary conditions?
more » « less
Full Text Available
X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot Agents

Moradshahi, Mehrad; Shen, Tianhao; Bali, Kalika; Choudhury, Monojit; de Chalendar, Gaël; Goel, Anmol; Kim, Sungkyun; Kodali, Prashant; Kumaraguru, Ponnurangam; Semmar, Nasredine; et al (July 2023, Findings of the Association for Computational Linguistics (ACL), Toronto, Canada, 2023)

Task-oriented dialogue research has mainly focused on a few popular languages like English and Chinese, due to the high dataset creation cost for a new language. To reduce the cost, we apply manual editing to automatically translated data. We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English- Hindi language. X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language, and unlike most multilingual prior work, is an end-to-end dataset for building fully-functioning agents. The many difficulties we encountered in creating X-RiSAWOZ led us to develop a toolset to accelerate the post-editing of a new language dataset after translation. This toolset improves machine translation with a hybrid entity alignment technique that combines neural with dictionary-based methods, along with many automated and semi-automated validation checks. We establish strong baselines for X-RiSAWOZ by training dialogue agents in the zero- and few-shot settings where limited gold data is available in the target language. Our results suggest that our translation and post-editing methodology and toolset can be used to create new high-quality multilingual dialogue agents cost-effectively. Our dataset,
more » « less
Full Text Available

Search for: All records